Ludwig - Papers: Statistics - Machine Learning

Fantastic Pretraining Optimizers and Where to Find Them

Kaiyue Wen, et al. • (2025) • DOI: 10.48550/arXiv.2509.02046

AdamW has long been the dominant optimizer in language model pretraining, despite numerous claims that alternative optimizers offer 1.4 to 2x speedup. We posit that two methodological shortcomings hav...

DataRater: Meta-Learned Dataset Curation

Dan A. Calian, et al. • (2025) • DOI: 10.48550/arXiv.2505.17895

The quality of foundation models depends heavily on their training data. Consequently, great efforts have been put into dataset curation. Yet most approaches rely on manual tuning of coarse-grained mi...

How Much Knowledge Can You Pack Into the Parameters of a Language Model?

Adam Roberts, et al. • • (2020) • DOI: 10.48550/arXiv.2002.08910

It has recently been observed that neural language models trained on unstructured text can implicitly store and retrieve knowledge using natural language queries. In this short paper, we measure the p...

General agents need world models

Jonathan Richens, et al. • • (2025) • DOI: 10.48550/arXiv.2506.01622

Are world models a necessary ingredient for flexible, goal-directed behaviour, or is model-free learning sufficient? We provide a formal answer to this question, showing that any agent capable of gene...

Deep Unsupervised Learning using Nonequilibrium Thermodynamics

Jascha Sohl-Dickstei..., et al. • • (2015) • DOI: 10.48550/arXiv.1503.03585

A central problem in machine learning involves modeling complex data-sets using highly flexible families of probability distributions in which learning, sampling, inference, and evaluation are still a...

Geometric Deep Learning: Grids, Groups, Graphs, Geodesics, and Gauges

Michael M. Bronstein, et al. • • (2021) • DOI: 10.48550/arXiv.2104.13478

The last decade has witnessed an experimental revolution in data science and machine learning, epitomised by deep learning methods. Indeed, many high-dimensional learning tasks previously thought to b...

Trade-offs in Data Memorization via Strong Data Processing Inequalities

Vitaly Feldman, et al. • • (2025) • DOI: 10.48550/arXiv.2506.01855

Recent research demonstrated that training large language models involves memorization of a significant fraction of training data. Such memorization can lead to privacy violations when training on sen...

Gluon: Making Muon & Scion Great Again! (Bridging Theory and Practice of LMO-based Optimizers for LLMs)

Artem Riabinin, et al. • • (2025) • DOI: 10.48550/arXiv.2505.13416

Recent developments in deep learning optimization have brought about radically new algorithms based on the Linear Minimization Oracle (LMO) framework, such as $\sf Muon$ and $\sf Scion$. After over a ...

Iteratively reweighted kernel machines efficiently learn sparse functions

Libin Zhu, et al. • • (2025) • DOI: 10.48550/arXiv.2505.08277

The impressive practical performance of neural networks is often attributed to their ability to learn low-dimensional data representations and hierarchical structure directly from data. In this work, ...

Mechanism of feature learning in deep fully connected networks and kernel machines that recursively learn features

Adityanarayanan Radh..., et al. • • (2023) • DOI: 10.48550/arXiv.2212.13881

In recent years neural networks have achieved impressive results on many technological and scientific tasks. Yet, the mechanism through which these models automatically select features, or patterns in...

Denoising Diffusion Probabilistic Models

Jonathan Ho, et al. • • (2020) • DOI: 10.48550/arXiv.2006.11239

We present high quality image synthesis results using diffusion probabilistic models, a class of latent variable models inspired by considerations from nonequilibrium thermodynamics. Our best results ...

Similarity of Neural Network Representations Revisited

Simon Kornblith, et al. • • (2019) • DOI: 10.48550/arXiv.1905.00414

Recent work has sought to understand the behavior of neural networks by comparing representations between layers and between different trained models. We examine methods for comparing neural network r...

A mathematical theory of semantic development in deep neural networks

Andrew M. Saxe, et al. • Proceedings of the National Academy of Sciences • (2019) • DOI: 10.1073/pnas.1820226116

An extensive body of empirical research has revealed remarkable regularities in the acquisition, organization, deployment, and neural representation of human semantic knowledge, thereby raising a fund...

Statistics - Machine Learning

Subcategories

Papers